{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b7d4eabb-8d9c-47a1-b94f-f660da8cf0de",
   "metadata": {},
   "source": [
    "# 抖音用户数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc065329-f7bf-4812-9481-dbd84f478725",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## 特征工程实践示例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70a343ab-b56d-4a8d-aace-0fd64edb56aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# 加载数据\n",
    "file_name = \"user_personalized_features.csv\"\n",
    "df = pd.read_csv(file_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfe47dce-92aa-4cfb-8ee6-6348e5e4e9d1",
   "metadata": {},
   "source": [
    "### 数据信息查看"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f3230ec-4b94-460a-b840-234b8751df8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initial Data Inspection\n",
    "print(\"Original Data Preview:\")\n",
    "print(df.head())\n",
    "print(\"\\nData Information:\")\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "477ed778-90d9-4e10-a4f9-571844959b77",
   "metadata": {},
   "source": [
    "### 删除重复的索引字段\n",
    "删除原始数据中冗余的索引列Unnamed: 0.1和Unamed: 0。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9431978-f682-47b0-9a3f-68ee11716834",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop redundant index columns\n",
    "df = df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'])\n",
    "\n",
    "print(\"DataFrame after dropping redundant columns:\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f1a24fc-a454-4531-b2a4-88c98e5d04fb",
   "metadata": {},
   "source": [
    "### 数值特征工程"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80c730a6-d606-4b51-80ee-362753bf6d05",
   "metadata": {},
   "source": [
    "#### 数值离散化/分箱 (Binning)\n",
    "年龄分箱 (Age_Group): 将Age特征离散化为 <20、20-35、36-50、51-65、65+ 五个年龄段。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85eba27f-1513-49ab-91b4-31e9aab5908b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Practice Case: Discretize Age into age groups.\n",
    "bins = [0, 20, 35, 50, 65, np.inf]\n",
    "labels = ['<20', '20-35', '36-50', '51-65', '65+']\n",
    "df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True)\n",
    "print(\"\\nAge_Group (Age Binning):\\n\", df[['Age', 'Age_Group']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7448a1ec-3726-45f8-8a16-6243c5ea0d6f",
   "metadata": {},
   "source": [
    "#### 数值标准化/归一化 (Scaling)\n",
    "- 收入Min-Max归一化 (Income_MinMaxScaled): 将Income收入特征归一化到0-1范围，适用于需要统一数值尺度的模型。\n",
    "- 网站停留时间Z-score标准化 (Time_Spent_on_Site_Minutes_ZScaled): 对Time_Spent_on_Site_Minutes进行Z-score标准化，使其均值为0，标准差为1。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9294e387-5593-4566-ba88-3b6e92e93c80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Practice Case: Standardize Income and Time_Spent_on_Site_Minutes.\n",
    "# Min-Max Scaling for Income to [0, 1]\n",
    "min_income = df['Income'].min()\n",
    "max_income = df['Income'].max()\n",
    "df['Income_MinMaxScaled'] = (df['Income'] - min_income) / (max_income - min_income)\n",
    "print(\"\\nIncome_MinMaxScaled:\\n\", df[['Income', 'Income_MinMaxScaled']].head())\n",
    "\n",
    "# Z-score Standardization for Time_Spent_on_Site_Minutes\n",
    "mean_time_spent = df['Time_Spent_on_Site_Minutes'].mean()\n",
    "std_time_spent = df['Time_Spent_on_Site_Minutes'].std()\n",
    "df['Time_Spent_on_Site_Minutes_ZScaled'] = (df['Time_Spent_on_Site_Minutes'] - mean_time_spent) / std_time_spent\n",
    "print(\"\\nTime_Spent_on_Site_Minutes_ZScaled:\\n\", df[['Time_Spent_on_Site_Minutes', 'Time_Spent_on_Site_Minutes_ZScaled']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a881e7fd-2583-4c54-bcbd-1bd15160c703",
   "metadata": {},
   "source": [
    "#### 特征交叉/组合 (Feature Interaction/Combination)\n",
    "- 创建了 Income_Per_Minute_Spent，表示每分钟网站停留时间所对应的收入。\n",
    "- 创建了 Purchase_Value_Per_Product，表示每浏览页面对应的总消费。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aec1865-f150-427b-bb88-f643e0161929",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Practice Case: Create new features like Income_Per_Minute_Spent and Purchase_Value_Per_Product.\n",
    "df['Income_Per_Minute_Spent'] = df['Income'] / (df['Time_Spent_on_Site_Minutes'] + 1e-6) # Add small epsilon to avoid division by zero\n",
    "print(\"\\nIncome_Per_Minute_Spent:\\n\", df[['Income', 'Time_Spent_on_Site_Minutes', 'Income_Per_Minute_Spent']].head())\n",
    "\n",
    "df['Purchase_Value_Per_Product'] = df['Total_Spending'] / (df['Pages_Viewed'] + 1e-6) # Assuming Pages_Viewed is a proxy for \"product interaction\" or just another numeric column to combine\n",
    "print(\"\\nPurchase_Value_Per_Product:\\n\", df[['Total_Spending', 'Pages_Viewed', 'Purchase_Value_Per_Product']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bf086d1-77df-4d9c-9a0f-f033fbdfa2a5",
   "metadata": {},
   "source": [
    "#### 多项式特征 (Polynomial Features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2aa943e-c8a7-4b51-8f80-1d3ca03ccdd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Practice Case: Create a squared term for Age.\n",
    "df['Age_Squared'] = df['Age'] ** 2\n",
    "print(\"\\nAge_Squared:\\n\", df[['Age', 'Age_Squared']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac7ce059-711e-4918-a822-eda1906bfe07",
   "metadata": {},
   "source": [
    "### 日期时间特征工程\n",
    "- 最后登录日期 (Last_Login_Date): 基于当前的日期 (2025-07-08) 和 Last_Login_Days_Ago 计算出精确的最后登录日期。\n",
    "- 日期时间组件提取: 从 Last_Login_Date 中提取了 Last_Login_Year (年)、Last_Login_Month (月)、Last_Login_Day (日) 和 Last_Login_DayOfWeek (星期几)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a8b8fd2-c6a3-428c-949f-92ed2eb3030d",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 5. Date/Time Feature Engineering ---\")\n",
    "\n",
    "# The current date is Tuesday, July 8, 2025.\n",
    "current_date = pd.to_datetime('2025-07-08')\n",
    "\n",
    "# Calculate Last_Login_Date\n",
    "df['Last_Login_Date'] = current_date - pd.to_timedelta(df['Last_Login_Days_Ago'], unit='D')\n",
    "print(\"\\nLast_Login_Date (derived from Last_Login_Days_Ago):\\n\", df[['Last_Login_Days_Ago', 'Last_Login_Date']].head())\n",
    "\n",
    "# Extract year, month, day, day of week from Last_Login_Date\n",
    "df['Last_Login_Year'] = df['Last_Login_Date'].dt.year\n",
    "df['Last_Login_Month'] = df['Last_Login_Date'].dt.month\n",
    "df['Last_Login_Day'] = df['Last_Login_Date'].dt.day\n",
    "df['Last_Login_DayOfWeek'] = df['Last_Login_Date'].dt.dayofweek # Monday=0, Sunday=6\n",
    "print(\"\\nExtracted Date/Time features:\\n\", df[['Last_Login_Date', 'Last_Login_Year', 'Last_Login_Month', 'Last_Login_Day', 'Last_Login_DayOfWeek']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07003fa2-289c-457d-9ea2-48e052bea549",
   "metadata": {},
   "source": [
    "### 布尔型值特征工程\n",
    "- 订阅状态编码 (Newsletter_Subscription_Encoded): 将布尔型的 Newsletter_Subscription 转换为数值型 (True 转换为 1，False 转换为 0)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bac2b202-d433-4b42-bfbd-78c34da33d8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 6. Boolean Feature Engineering ---\")\n",
    "\n",
    "# Convert Newsletter_Subscription from boolean to integer (0 or 1)\n",
    "df['Newsletter_Subscription_Encoded'] = df['Newsletter_Subscription'].astype(int)\n",
    "print(\"\\nNewsletter_Subscription_Encoded:\\n\", df[['Newsletter_Subscription', 'Newsletter_Subscription_Encoded']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90a3070d-da49-4788-b3d2-c9aac4cea320",
   "metadata": {},
   "source": [
    "### 基于分组的特征工程\n",
    "- 地点平均总消费 (Location_Avg_Total_Spending): 计算了每个 Location (地点) 的用户平均 Total_Spending (总消费)。\n",
    "- 产品类别平均订单价值标准差 (Product_Category_Preference_Std_Avg_Order_Value): 计算了每个 Product_Category_Preference (产品类别偏好) 的 Average_Order_Value (平均订单价值) 的标准差。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "318bac34-e885-4c51-ab40-d4fd6ac9fb95",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 7. Group-based Feature Engineering ---\")\n",
    "\n",
    "# Calculate average Total_Spending per Location\n",
    "df['Location_Avg_Total_Spending'] = df.groupby('Location')['Total_Spending'].transform('mean')\n",
    "print(\"\\nLocation_Avg_Total_Spending:\\n\", df[['Location', 'Total_Spending', 'Location_Avg_Total_Spending']].head())\n",
    "\n",
    "# Calculate standard deviation of Average_Order_Value per Product_Category_Preference\n",
    "df['Product_Category_Preference_Std_Avg_Order_Value'] = df.groupby('Product_Category_Preference')['Average_Order_Value'].transform('std')\n",
    "print(\"\\nProduct_Category_Preference_Std_Avg_Order_Value:\\n\", df[['Product_Category_Preference', 'Average_Order_Value', 'Product_Category_Preference_Std_Avg_Order_Value']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "651682ca-e3ad-4929-95d2-9fc387875abb",
   "metadata": {},
   "source": [
    "### 类别特征程"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c79d209f-a86c-4248-9584-4647bf78bca5",
   "metadata": {},
   "source": [
    "#### 独热编码(One-Hot Encoding)\n",
    "- 独热编码: 对Gender(性别) 和Location (地点) 进行了独热编码，将它们转换为多个二进制列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "974d99e2-6b5b-430a-be43-027f2ecc6c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to Gender and Location.\n",
    "df = pd.get_dummies(df, columns=['Gender', 'Location'], prefix=['Gender', 'Location'])\n",
    "print(\"\\nDataFrame after One-Hot Encoding (Gender, Location):\\n\", df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de14b76e-5383-4b2a-b8e5-e78cee473614",
   "metadata": {},
   "source": [
    "#### 频率编码(Frequency Encoding)\n",
    "- 对Product_Category_Preference进行了频率编码，用每个产品类别的出现频率替换其原始类别值。\n",
    "- 对Interests (兴趣) 进行了频率编码，用每个兴趣类别的出现频率替换其原始类别值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46a18b3b-c6bc-4737-b013-821cfa979e65",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply to Product_Category_Preference and Interests.\n",
    "product_category_frequency = df['Product_Category_Preference'].value_counts(normalize=True)\n",
    "df['Product_Category_Preference_FreqEncoded'] = df['Product_Category_Preference'].map(product_category_frequency)\n",
    "print(\"\\nProduct_Category_Preference_FreqEncoded:\\n\", df[['Product_Category_Preference', 'Product_Category_Preference_FreqEncoded']].head())\n",
    "\n",
    "interests_frequency = df['Interests'].value_counts(normalize=True)\n",
    "df['Interests_FreqEncoded'] = df['Interests'].map(interests_frequency)\n",
    "print(\"\\nInterests_FreqEncoded:\\n\", df[['Interests', 'Interests_FreqEncoded']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fcb5320-9932-410a-8de5-a4f3889a00e6",
   "metadata": {},
   "source": [
    "### 处理缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fe3f139-1e51-44c2-983d-632b3b23b4e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 8. Handling Missing Values (Check only) ---\")\n",
    "print(\"\\nMissing values before any imputation:\\n\", df.isnull().sum())\n",
    "# 由info()的输出可知，该数据集没有缺失值，因此无须进行任何的缺失值处理操作。若存在缺失值的情况，可以使用类似如下的方法来进行缺失值填充。\n",
    "# df['Column'].fillna(df['Column'].mean(), inplace=True) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08ab1913-1b79-47e0-bca6-8d96db4caf24",
   "metadata": {},
   "source": [
    "### 处理异常值\n",
    "- 总消费封顶 (Total_Spending_Capped): 使用 IQR (四分位距) 方法检测并对 Total_Spending (总消费) 中的异常值进行了封顶处理，将超出上下限的值替换为相应的限制值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50aedf68-6223-4316-b99c-3ce70541a58b",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 9. Handling Outliers (Capping) ---\")\n",
    "\n",
    "# Practice Case: Outlier capping on Total_Spending using IQR.\n",
    "Q1_spending = df['Total_Spending'].quantile(0.25)\n",
    "Q3_spending = df['Total_Spending'].quantile(0.75)\n",
    "IQR_spending = Q3_spending - Q1_spending\n",
    "lower_bound_spending = Q1_spending - 1.5 * IQR_spending\n",
    "upper_bound_spending = Q3_spending + 1.5 * IQR_spending\n",
    "\n",
    "print(f\"\\nTotal_Spending IQR Detection: Q1={Q1_spending}, Q3={Q3_spending}, IQR={IQR_spending}\")\n",
    "print(f\"Lower Bound (Total_Spending): {lower_bound_spending}\")\n",
    "print(f\"Upper Bound (Total_Spending): {upper_bound_spending}\")\n",
    "\n",
    "# Apply capping\n",
    "df['Total_Spending_Capped'] = np.where(\n",
    "    df['Total_Spending'] < lower_bound_spending,\n",
    "    lower_bound_spending,\n",
    "    np.where(df['Total_Spending'] > upper_bound_spending, upper_bound_spending, df['Total_Spending'])\n",
    ")\n",
    "print(\"\\nTotal_Spending_Capped (first 10 rows to show potential changes):\\n\", df[['Total_Spending', 'Total_Spending_Capped']].head(10))\n",
    "\n",
    "# Final check of the DataFrame with new features\n",
    "print(\"\\nFinal DataFrame with engineered features preview:\")\n",
    "print(df.head())\n",
    "print(\"\\nFinal DataFrame Info:\")\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af1fdf6b-c347-46a3-9260-57f59d75d0af",
   "metadata": {},
   "source": [
    "### 保存处理后的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aedfe384-12a5-4948-bb32-9e6f838af301",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the transformed data to a CSV file\n",
    "df.to_csv('engineered_user_features.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47050e56-7330-4cdd-b1a9-b72bd6c03d4f",
   "metadata": {},
   "source": [
    "## 绘图示例"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "677966b4-3aa9-4862-a055-951e40f38e6b",
   "metadata": {},
   "source": [
    "### 环境准备与数据加载"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "538855b4-0bdb-4bec-a957-84ed9385f5a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# 加载上一节生成的特征工程后的数据\n",
    "df = pd.read_csv('engineered_user_features.csv')\n",
    "\n",
    "# 设置 Matplotlib 风格，让图表更美观\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "# 设置中文字体（根据你的系统选择合适的字体，例如 'SimHei' 或 'Microsoft YaHei'）\n",
    "plt.rcParams['font.sans-serif'] = ['Arial Unicode MS'] # For macOS\n",
    "# plt.rcParams['font.sans-serif'] = ['SimHei'] # For Windows/Linux\n",
    "plt.rcParams['axes.unicode_minus'] = False # 解决负号显示问题\n",
    "\n",
    "print(\"特征工程后数据预览:\")\n",
    "print(df.head())\n",
    "print(\"\\n数据信息:\")\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "faafa139-81a1-454c-a92f-233b3740926e",
   "metadata": {},
   "source": [
    "### 单变量分布分析"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c40bc0a-341a-441c-b459-1eb666016081",
   "metadata": {},
   "source": [
    "#### 直方图 (Histograms) - 数值特征分布\n",
    "- 直方图显示了数值数据在不同区间内的频率分布。\n",
    "- 示例：绘制Age和Income的直方图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90f0307d-b066-4e3b-86d7-c245dfedf560",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 2.1 直方图 (Histograms) ---\")\n",
    "\n",
    "plt.figure(figsize=(12, 5))\n",
    "\n",
    "plt.subplot(1, 2, 1) # 1行2列，第1个图\n",
    "sns.histplot(df['Age'], bins=10, kde=True, color='skyblue')\n",
    "plt.title('用户年龄分布')\n",
    "plt.xlabel('年龄')\n",
    "plt.ylabel('数量')\n",
    "\n",
    "plt.subplot(1, 2, 2) # 1行2列，第2个图\n",
    "sns.histplot(df['Income'], bins=15, kde=True, color='lightcoral')\n",
    "plt.title('用户收入分布')\n",
    "plt.xlabel('收入')\n",
    "plt.ylabel('数量')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ef77566-9edd-4feb-abed-f1dad4a47a9f",
   "metadata": {},
   "source": [
    "#### 条形图 (Bar Plots) - 类别特征分布\n",
    "- 条形图用于显示类别特征中每个类别的计数或频率。\n",
    "- 实践案例：绘制 Gender 和 Age_Group 的分布。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe03cd3f-dc4e-4907-bd93-b94adc7ff291",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 2.2 条形图 (Bar Plots) ---\")\n",
    "\n",
    "plt.figure(figsize=(14, 6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "sns.countplot(data=df, x='Gender_Male', palette='viridis') # Using one-hot encoded Gender\n",
    "plt.xticks([0, 1], ['Female', 'Male']) # Manually set labels based on Gender_Male\n",
    "plt.title('用户性别分布')\n",
    "plt.xlabel('性别')\n",
    "plt.ylabel('数量')\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "sns.countplot(data=df, x='Age_Group', order=df['Age_Group'].value_counts().index, palette='magma')\n",
    "plt.title('用户年龄组分布')\n",
    "plt.xlabel('年龄组')\n",
    "plt.ylabel('数量')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3402c8db-dd1d-4f29-a1be-f432fd76ad94",
   "metadata": {},
   "source": [
    "#### 箱线图 (Box Plots) - 数值特征的统计概览和异常值检测\n",
    "- 箱线图显示了数据的中位数、四分位数、离群点等，非常适合检测异常值。\n",
    "- 实践案例： 绘制 Total_Spending 的箱线图，观察异常值处理效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f5d86db-730a-4701-9078-0b87aff36ba2",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 2.3 箱线图 (Box Plots) ---\")\n",
    "\n",
    "plt.figure(figsize=(12, 6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "sns.boxplot(y=df['Total_Spending'], color='lightgreen')\n",
    "plt.title('总消费分布 (原始)')\n",
    "plt.ylabel('总消费')\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "sns.boxplot(y=df['Total_Spending_Capped'], color='lightpink')\n",
    "plt.title('总消费分布 (封顶后)')\n",
    "plt.ylabel('总消费 (封顶)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3af5af5-ab1c-43b4-9256-5713a2706f14",
   "metadata": {},
   "source": [
    "### 双变量关系分析\n",
    "- 探索两个特征之间的关系是发现有用模式的关键。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0c1b8ab-7174-4eda-9a0f-dd32339255c7",
   "metadata": {},
   "source": [
    "#### 散点图 (Scatter Plots) - 数值特征之间关系\n",
    "- 散点图用于显示两个数值变量之间的关系。\n",
    "- 实践案例： 绘制 Income 和 Total_Spending 之间的关系，以及 Time_Spent_on_Site_Minutes 和 Pages_Viewed 之间的关系。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7e91bc5-8e5f-40b0-8049-cd98cdd22bc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 3.1 散点图 (Scatter Plots) ---\")\n",
    "\n",
    "plt.figure(figsize=(14, 6))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "sns.scatterplot(data=df, x='Income', y='Total_Spending', hue='Gender_Male', palette='coolwarm', alpha=0.7)\n",
    "plt.title('收入 vs. 总消费')\n",
    "plt.xlabel('收入')\n",
    "plt.ylabel('总消费')\n",
    "plt.legend(title='性别 (1=男, 0=女)')\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "sns.scatterplot(data=df, x='Time_Spent_on_Site_Minutes', y='Pages_Viewed', color='purple', alpha=0.6)\n",
    "plt.title('网站停留时间 vs. 浏览页面数')\n",
    "plt.xlabel('网站停留时间 (分钟)')\n",
    "plt.ylabel('浏览页面数')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fcd392f-033a-49fb-a200-309d41cabee4",
   "metadata": {},
   "source": [
    "#### 小提琴图 (Violin Plots) - 类别与数值特征关系\n",
    "- 小提琴图结合了箱线图和核密度估计图的优点，显示了数值特征在不同类别下的分布。\n",
    "- 实践案例： 绘制 Location 对 Income 的影响。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b328be69-e8ea-4cc5-8b6b-e60aef020cc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 3.2 小提琴图 (Violin Plots) ---\")\n",
    "\n",
    "# 重建一个临时的 'Location' 列，以便用其原始类别值进行绘图\n",
    "# 或者，直接使用 one-hot 编码的列进行分类对比\n",
    "# 为了展示清晰，这里假设我们能直接访问原始的 Location 列进行分组。\n",
    "# 由于原始数据中 Location 有 'Rural', 'Suburban', 'Urban'，我们使用这些。\n",
    "# 注意：在实际特征工程后，原始列可能已被替换。这里为了绘图方便，我们直接基于数值列和类别分组。\n",
    "# 如果原始 Location 列已被删除，则需要根据 one-hot encoded 列重新创建分类标签。\n",
    "\n",
    "# 从原始 df 中获取 Location 用于绘图 (或重新加载)\n",
    "df_plot_original_location = pd.read_csv('user_personalized_features.csv').drop(columns=['Unnamed: 0.1', 'Unnamed: 0'])\n",
    "\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.violinplot(data=df_plot_original_location, x='Location', y='Income', palette='Set2')\n",
    "plt.title('不同地点用户的收入分布')\n",
    "plt.xlabel('地点')\n",
    "plt.ylabel('收入')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3eebe459-6000-42b5-8aef-a74221d39d16",
   "metadata": {},
   "source": [
    "#### 分组柱状图 (Grouped Bar Charts) - 类别与类别特征关系\n",
    "- 用于比较不同类别组的计数或频率。\n",
    "- 实践案例： 绘制不同 Age_Group 中 Gender 的分布。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dd5fdf1-6322-4dd4-a09b-2c340bd7046e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 3.3 分组柱状图 (Grouped Bar Charts) ---\")\n",
    "\n",
    "# 使用melt函数将Gender_Female和Gender_Male转换成一个统一的性别列，方便绘图\n",
    "df_melted_gender_age = df.melt(id_vars=['Age_Group'], value_vars=['Gender_Female', 'Gender_Male'],\n",
    "                               var_name='Gender_Encoded', value_name='Count')\n",
    "# 筛选出 Count 为 1 的行，因为独热编码只有0和1\n",
    "df_melted_gender_age = df_melted_gender_age[df_melted_gender_age['Count'] == 1]\n",
    "df_melted_gender_age['Gender'] = df_melted_gender_age['Gender_Encoded'].apply(lambda x: 'Female' if 'Female' in x else 'Male')\n",
    "\n",
    "plt.figure(figsize=(12, 7))\n",
    "sns.countplot(data=df_melted_gender_age, x='Age_Group', hue='Gender', palette='pastel',\n",
    "              order=sorted(df['Age_Group'].dropna().unique().astype(str))) # Ensure correct order\n",
    "plt.title('不同年龄组的用户性别分布')\n",
    "plt.xlabel('年龄组')\n",
    "plt.ylabel('数量')\n",
    "plt.legend(title='性别')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de1bea3c-de64-4d77-aff4-fdf8fab0968a",
   "metadata": {},
   "source": [
    "### 多变量关系分析\n",
    "- 探索三个或更多特征之间的关系，通常需要更复杂的图表或组合图。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0304ad9-24cd-4fcc-ada6-0fc1dcc52ce5",
   "metadata": {},
   "source": [
    "#### 相关性热力图 (Correlation Heatmaps)\n",
    "- 相关性热力图显示了数据集中所有数值特征之间的相关性。这是发现高度相关（可能冗余）或与目标变量强相关特征的重要工具。\n",
    "- 实践案例： 绘制所有数值特征的相关性热力图。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7ec5219-fad7-45a9-9669-142f390ced1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 4.1 相关性热力图 (Correlation Heatmaps) ---\")\n",
    "\n",
    "# 选择数值特征\n",
    "numeric_cols = df.select_dtypes(include=np.number).columns\n",
    "# 排除 User_ID 和可能不需要分析相关性的编码特征（如 Last_Login_Year/Month/Day）\n",
    "# 但为了全面展示，我们先包含大部分，再考虑排除\n",
    "features_for_correlation = [col for col in numeric_cols if col not in ['User_ID', 'Newsletter_Subscription_Encoded',\n",
    "                                                                        'Last_Login_Year', 'Last_Login_Month', 'Last_Login_Day', 'Last_Login_DayOfWeek',\n",
    "                                                                        'Gender_Female', 'Gender_Male', 'Location_Rural', 'Location_Suburban', 'Location_Urban']]\n",
    "correlation_matrix = df[features_for_correlation].corr()\n",
    "\n",
    "plt.figure(figsize=(14, 10))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=\".2f\", linewidths=.5)\n",
    "plt.title('数值特征相关性热力图')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06f284b9-1820-4377-8487-41147c3c86d5",
   "metadata": {},
   "source": [
    "#### Pair Plot (散点图矩阵)\n",
    "- sns.pairplot 可以绘制数据集中所有数值特征两两之间的散点图，并在对角线上绘制每个特征的分布图。\n",
    "- 实践案例： 对一些核心数值特征进行 Pair Plot 分析。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c8de917-1c8a-42d6-8deb-a4b24e2027c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n--- 4.2 Pair Plot (散点图矩阵) ---\")\n",
    "\n",
    "# 选择一些核心数值特征\n",
    "selected_features_for_pairplot = ['Age', 'Income', 'Total_Spending', 'Purchase_Frequency', 'Time_Spent_on_Site_Minutes']\n",
    "\n",
    "# 可以根据Gender进行着色\n",
    "sns.pairplot(df, vars=selected_features_for_pairplot, hue='Gender_Male', palette='viridis', diag_kind='kde')\n",
    "plt.suptitle('主要数值特征的散点图矩阵 (按性别着色)', y=1.02) # y调整标题位置\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
